Tool response quality, turn tracing, agent eval CLI — release 0.7.0#6
Merged
Conversation
…t convention
Models trained on Claude Code's Edit tool send file_edit({old_text: "",
new_text: <content>}) to create new files. SmallHarness's file_edit
previously returned "File not found" immediately, forcing a retry loop
(mkdir → touch → file_edit) that wasted 2-3 extra API round-trips.
Now: a single edit with old_text="" on a missing file creates the file
(including parent dirs), matching the Claude Code convention exactly.
Non-creation cases that hit "not found" or "old_text is empty" now also
include "Use file_write to create new files" for faster recovery.
Adds two tests: one for the new creation path, one confirming the
non-empty-old_text-on-missing-file error still fires with the hint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
file_read: offset past EOF now returns a clear error instead of silently
returning empty content, which caused models to think files were empty
and retry with different offsets.
list_dir: add "total" field to every response so models know the real
directory size when truncated (count capped at 500 but total reflects
the actual entry count).
grep: switch map → filter_map so unparseable rg output lines (e.g.
binary-file notices) are dropped rather than emitted as malformed
{content: "..."} objects missing the file and line fields. Also moves
.take(100) after filter_map to ensure up to 100 *parseable* matches.
Adds the first test module for grep.rs (6 new tests total across the
three files).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y, session commands/mod.rs had grown past 3,000 lines. Move the command handlers into four focused submodules — config_cmds (/config, /backend, /model, /verbose…), context_cmds (/context, /compact, /reset, /checkpoints), memory (/index, /map, /memory, /remember, /forget), and session (/new, /undo, /session, /resume, /export, /path) — leaving dispatch and the command list in mod.rs. No behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add turn_trace: every turn appends structured events (tool calls with redacted args, approvals, compaction, warmup, timing) to a sidecar at .sessions/<session-id>.events.jsonl, enabled by default via display.eventLog.enabled. API keys and sensitive object keys are redacted before anything is written. /trace on|off surfaces nested subagent/critic tool calls as indented lines in the TUI — previously their activity was invisible (events swallowed) — without flooding the parent context. Tool calls now carry a depth field, and the subagent/evaluator tools forward their inner events when tracing is on. The end-of-turn status line gains a timing breakdown (TTFT, model, tools, approval, total), the loader shows which tool is running, compaction of oversized tool output is now reported to the user with the original size, and /export <session> events copies the event log. Also prints a context pressure notice as the prompt budget nears the model's effective limit. /export current events copies the sidecar; /new and /resume reset it to the active session. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…CI job small-harness --eval <fixture> [--model M] [--json] runs a bundled agent eval fixture from the shell and exits 0 on pass / 1 on fail, so evals can gate CI. A new optional macos CI job runs two fixtures against Ollama nightly or when a commit message contains [eval]; it is continue-on-error so a flaky local model never blocks merges. Add agent_integration_test: drives the real agent loop against a mock OpenAI-compatible SSE server (no live LLM) covering a tool-call round trip plus eval checks, and the hit_step_limit cutoff flag. Two fixes surfaced while wiring this up: the rubric heading parser now matches "(weight:" case-insensitively on raw bytes instead of byte offsets from a lowercased copy (which can diverge for some Unicode chars), and the HTTP client gets a 10s connect timeout so a dead backend fails fast instead of hanging — without capping long streaming completions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fix: tool response quality for three model-facing edge cases (file_readoffset past EOF,list_dirtruncationtotal,grepunparseable lines)refactor: split the 3,000-linecommands/mod.rsintoconfig_cmds,context_cmds,memory,sessionfeat: turn tracing —/traceview,.sessions/<id>.events.jsonlevent log, turn timing footerfeat: agent eval CLI (--eval) with mock-SSE integration tests and an optional nightly CI jobchore: release 0.7.0 (Cargo version, CHANGELOG, README badge)Each commit compiles and passes
cargo test,clippy, andfmtindependently.Merging with a merge commit (not squash) to keep the per-feature commits;
v0.7.0will be tagged on the release commit after merge, which triggers the release workflow and the Homebrew tap update.🤖 Generated with Claude Code